{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "## 搭建一个简单的问答系统"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "本次项目的目标是搭建一个基于检索式的简单的问答系统。 \n",
    "\n",
    "知识点：\n",
    "1. 字符串操作   2. 文本预处理技术（词过滤，标准化）   3. 文本的表示（tf-idf, word2vec)  4. 文本相似度计算  5. 文本高效检索\n",
    "\n",
    "此项目需要的数据：\n",
    "1. [train-v2.0.json](https://github.com/SixGreen/small-programs/blob/master/data/train-v2.0.json): 这个数据包含了问题和答案的pair， 但是以JSON格式存在，需要编写parser来提取出里面的问题和答案。 \n",
    "2. [glove.6B](https://nlp.stanford.edu/projects/glove/): 使用d=100的词向量\n",
    "\n",
    "**检索式的问答系统**\n",
    "\n",
    "问答系统所需要的数据已经提供，对于每一个问题都可以找得到相应的答案，所以可以理解为每一个样本数据是 <问题、答案>。 那系统的核心是当用户输入一个问题的时候，首先要找到跟这个问题最相近的已经存储在库里的问题，然后直接返回相应的答案即可。 举一个简单的例子：\n",
    "\n",
    "假设我们的库里面已有存在以下几个<问题,答案>：\n",
    "<\"人工智能和机器学习的关系什么？\", \"其实机器学习是人工智能的一个范畴，很多人工智能的应用要基于机器学习的技术\">\n",
    "<\"人工智能最核心的语言是什么？\"， \"Python\">\n",
    ".....\n",
    "\n",
    "假设一个用户往系统中输入了问题 “机器学习与人工智能有什么关系？”， 那这时候系统先去匹配最相近的“已经存在库里的”问题。 那在这里很显然是 “机器学习与人工智能有什么关系”和“人工智能和机器学习的关系什么”是最相近的。 所以当我们定位到这个问题之后，直接返回它的答案 “其实机器学习是人工智能的一个范畴，很多人工智能的应用要基于机器学习的技术”就可以了。所以这里的核心问题可以归结为计算两个问句（query）之间的相似度。\n",
    "\n",
    "在本次项目中，你会频繁地使用到[sklearn](http://scikit-learn.org/stable/install.html)这个机器学习库，sklearn包含了各类机器学习算法和数据处理工具，包括本项目需要使用的词袋模型。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 读取文件，并把内容分别写到两个list里（一个list对应问题集，另一个list对应答案集）"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 77,
   "metadata": {},
   "outputs": [],
   "source": [
    "import json\n",
    "\n",
    "def read_corpus(filepath):\n",
    "    \"\"\"\n",
    "    读取给定的语料库，并把问题列表和答案列表分别写入到 qlist, alist 里面。 在此过程中，不用对字符换做任何的处理（这部分要在后面处理）\n",
    "    qlist = [\"问题1\"， “问题2”， “问题3” ....]\n",
    "    alist = [\"答案1\", \"答案2\", \"答案3\" ....]\n",
    "    务必要让每一个问题和答案对应起来（下标位置一致）\n",
    "    \"\"\"\n",
    "    with open(filepath) as f:\n",
    "        data = json.load(f)\n",
    "        \n",
    "    qlist = list()\n",
    "    alist = list()\n",
    "    for item in data['data']:\n",
    "        for para in item['paragraphs']:\n",
    "            for qa in para['qas']:\n",
    "                qlist.append(qa['question'])\n",
    "                # 部分answers的list为空，所以会引发IndexError\n",
    "                try:\n",
    "                    alist.append(qa['answers'][0]['text'])\n",
    "                except IndexError:\n",
    "                    qlist.pop()\n",
    "        \n",
    "    assert len(qlist) == len(alist)  # 确保长度一样\n",
    "    return qlist, alist"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 79,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "问答数量：86821\n",
      "['With what Belorussian city does Kathmandu have a relationship?', 'In what year did Kathmandu create its initial international relationship?', 'What is KMC an initialism of?']\n",
      "['Minsk', '1975', 'Kathmandu Metropolitan City']\n"
     ]
    }
   ],
   "source": [
    "# 测试\n",
    "qlist, alist = read_corpus('./data/train-v2.0.json')\n",
    "print(\"问答数量：%d\" % len(qtmp))\n",
    "print(qlist[-3:])\n",
    "print(alist[-3:])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 理解数据（可视化分析/统计信息）\n",
    "对数据的理解是任何AI工作的第一步，需要充分对手上的数据有个更直观的理解。 "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 80,
   "metadata": {
    "scrolled": false
   },
   "outputs": [
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYoAAAD8CAYAAABpcuN4AAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDMuMC4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvOIA7rQAAIABJREFUeJzt3XuYVOWd4PHvr6r6Ak03zaVpGhoEFLW7MeHSK2Q0KhoFiQOO4gxZNrIODvs4Jisms5FsdjfZySRBVxMxMeYxilHHGeKQyUC8jEvAaOJ6ay6i0NwElAZs0Aaae1/qt3+ct5oCqqurb3WqT/8+z1NPnfOet855T/V7+lfvec95j6gqxhhjTGtCfhfAGGNMZrNAYYwxJikLFMYYY5KyQGGMMSYpCxTGGGOSskBhjDEmKQsUxhhjkrJAYYwxJikLFMYYY5KK+F2Ajho8eLCOGjXK72KYgFq7du2nqlqU7u1avTbdqaP1uscGilGjRlFVVeV3MUxAichHfmzX6rXpTh2t13bqyRhjTFIpBQoRKRSR5SKyRUSqReQLIjJQRFaJyHb3PsDlFRF5RER2iMhGEZkYt555Lv92EZkXlz5JRN53n3lERKTrd9WY8x0+fJjZs2dz6aWXUlZWxptvvkldXR3AWKvbxnhSbVEsAf5dVS8FPg9UA4uA1ao6Fljt5gFuBMa61wLgMQARGQh8F5gMXA58N3YAujwL4j43vSM78+7uOi7/we9Z9/Ghjnzc9EL33HMP06dPZ8uWLbz33nuUlZWxePFigKOZVLf/14oPuOXnb3Tko8Z0WpuBQkQKgKuAJwFUtUFVDwOzgKddtqeBm930LOAZ9bwFFIpICTANWKWqdap6CFgFTHfLClT1TfXGPH8mbl3t0tgU5cDR0zQ2RTvycdPL1NfX8/rrrzN//nwAsrOzKSwsZMWKFQCfuWwZUbfrTzby6bGGDu6pMZ2TSotiDHAQeEpE1ovIEyKSBxSr6n4A9z7E5R8O7In7fI1LS5ZekyD9PCKyQESqRKTq4MGDKRTdmNbt3LmToqIi7rjjDiZMmMCdd97J8ePHqa2tBWiE9NTtVOq1iKDYs2OMP1IJFBFgIvCYqk4AjnOmKZ5IonOw2oH08xNVH1fVSlWtLCpq/QovO5xMKpqamli3bh133XUX69evJy8vL3baqTXdUrdTqdcC2DPGjF9SCRQ1QI2qvu3ml+MFjlrXtMa9H4jLPyLu86XAvjbSSxOkt591E5p2KC0tpbS0lMmTJwMwe/Zs1q1bR3FxMUAWZFbdtkBh/NJmoFDVT4A9InKJS7oO2AysBGJXd8wDVrjplcDt7gqRKcAR13x/BbhBRAa4jr4bgFfcsqMiMsVdEXJ73Lo6xA4ok4qhQ4cyYsQItm7dCsDq1aspLy9n5syZAINctoyo22K/goyPUr3h7uvAcyKSDewE7sALMs+LyHzgY+A2l/clYAawAzjh8qKqdSLyfeBdl+/vVbXOTd8F/AroA7zsXu1mB5Npr5/+9KfMnTuXhoYGxowZw1NPPUU0GuXBBx8sEJHtZErdFrDn2xu/pBQoVHUDUJlg0XUJ8ipwdyvrWQosTZBeBYxLpSzGdKXx48e3dif0NlU9q877WbcF63sz/gnkndl2dYgJGrE+CuOjQAUKu+fVBJWdVjV+ClSgMCbIrKVs/BLMQGHHkwkYO/Vk/BSoQGGNcxNUIvb7x/gnUIHCmOASa1EY3wQyUNjxZILGu1DDarbxR6AChQ31b4LKxnoyfgpUoDAmqKyPwvgpkIHCfnmZoBHEhvAwvglUoLAzTyaorEVh/BSoQGFMUFkfhfFTIAOF3cFqgsYu1DB+ClSgsEPJBJn1URi/BCpQGBNkFiaMXwIZKOyHlwkasQdSGB8FKlDYaVwTVIJYnDC+CVSgMCao7FGoxk+BDBR2OJmgsTNPxk8BCxR27skEkz2Pwvgp5UAhImERWS8iL7j50SLytohsF5Ffi0i2S89x8zvc8lFx6/i2S98qItPi0qe7tB0isqjrds+Y5Jqbm5kwYQI33XQTALt27WLy5MkA4zKpXouI3R9kfNOeFsU9QHXc/P3AT1R1LHAImO/S5wOHVPUi4CcuHyJSDswBKoDpwM9d8AkDjwI3AuXAV1zeDrNzuSZVS5YsoaysrGX+vvvu49577wX4gAyq13ZntvFTSoFCREqBLwNPuHkBrgWWuyxPAze76VluHrf8Opd/FrBMVU+r6i5gB3C5e+1Q1Z2q2gAsc3nbza56Mu1RU1PDiy++yJ133gl4PzDWrFnD7NmzY1kyol4DdlbV+CrVFsXDwLeAqJsfBBxW1SY3XwMMd9PDgT0AbvkRl78l/ZzPtJZuTLdauHAhDzzwAKGQdxh89tlnFBYWEolEYlkyql5bg8L4pc1AISI3AQdUdW18coKs2say9qYnKssCEakSkaqDBw+2WmY7oExbXnjhBYYMGcKkSZNa0lo5ZZkR9Vqw4WONfyJtZ+EKYKaIzABygQK8FkahiETcr6tSYJ/LXwOMAGpEJAL0B+ri0mPiP9Na+llU9XHgcYDKysrzDhtrnZtUvfHGG6xcuZKXXnqJU6dOUV9fz8KFCzl8+DBNTbGGcmbUa4gNM26RwvijzRaFqn5bVUtVdRRep90aVZ0LvArETubOA1a46ZVuHrd8jXo/1VYCc9zVI6OBscA7wLvAWHcVVbbbxsou2TtjWvGjH/2Impoadu/ezbJly7j22mt57rnnmDp1KsuXx7reMqdeW2e28VMqLYrW3AcsE5F/ANYDT7r0J4FnRWQH3i+uOQCquklEngc2A03A3araDCAiXwNeAcLAUlXd1IlyWRPddNj999/PnDlzAMYBu8iQem0PLjJ+alegUNU/AH9w0zvxruw4N88p4LZWPv8D4AcJ0l8CXmpPWRKxMftNR1xzzTVcc801AIwZM4Z33nkHEflAVVvqsZ/1GuxRqMZfAbsz25hgshaF8VMgA4V1+pmgsT4K46dABQo78WQCy06rGh8FKlAYE1SxMGH9FMYPgQwUdiyZoLEGhfFToAKFHUwm6OxHkPFDoAKFMUEl7uSTxQnjh0AGCvvVZYIm1lq2Pgrjh0AFCrHrnkxAtXRm+1oK01sFKlAYE1RnWhT+lsP0ToEMFHYsmaCJDU9jN5MaPwQqUNhVTyborEVh/BCoQGFMUNmPIOOnQAYKuzLEBE3L5bFWtY0PAhkojAkaa1EYP1mgMKYHsc5s44dABgo7lEzQnBkU0NdimF4qUIHCmucmqFruo/C3GKaXClSgMCaoznRmW6gw6RfIQGHHkgkaa1EYPwUqUNhYTybo7EeQ8UObgUJERojIqyJSLSKbROQelz5QRFaJyHb3PsCli4g8IiI7RGSjiEyMW9c8l3+7iMyLS58kIu+7zzwiYr0Npvvt2bOHqVOnUlZWRkVFBUuWLAGgrq4OYGwm1e2Q2Kkn459UWhRNwDdVtQyYAtwtIuXAImC1qo4FVrt5gBuBse61AHgMvMACfBeYDFwOfDd2ALo8C+I+N71zu2UHk2lbJBLhoYceorq6mrfeeotHH32UzZs3s3jxYoCjmVS3s8JeoGiKWt026ddmoFDV/aq6zk0fBaqB4cAs4GmX7WngZjc9C3hGPW8BhSJSAkwDVqlqnaoeAlYB092yAlV9U72fS8/EratdrB1i2qOkpISJE71GQX5+PmVlZezdu5cVK1YAfOayZUTdzgp7h2pDU7QjHzemU9rVRyEio4AJwNtAsaruBy+YAENctuHAnriP1bi0ZOk1CdITbX+BiFSJSNXBgwdbLae1zk177d69m/Xr1zN58mRqa2sBGiE9dTuVep0d8Q7VxmYLFCb9Ug4UItIP+A2wUFXrk2VNkKYdSD8/UfVxVa1U1cqioqIEZUxSKmNacezYMW699VYefvhhCgoKkmXtlrrdVr0Ga1EYf6UUKEQkCy9IPKeq/+qSa13TGvd+wKXXACPiPl4K7GsjvTRBujHdrrGxkVtvvZW5c+dyyy23AFBcXAyQBZlTt2MtigZrURgfpHLVkwBPAtWq+uO4RSuB2NUd84AVcem3uytEpgBHXPP9FeAGERngOvpuAF5xy46KyBS3rdvj1tUhdubJpEJVmT9/PmVlZXzjG99oSZ85cybAIDebEXU721oUxkeRFPJcAXwVeF9ENri0/w4sBp4XkfnAx8BtbtlLwAxgB3ACuANAVetE5PvAuy7f36tqnZu+C/gV0Ad42b3aze6jMO3xxhtv8Oyzz3LZZZcxfvx4AH74wx+yaNEiHnzwwQIR2U6G1O0zfRT2M8ikX5uBQlX/ROJzrQDXJcivwN2trGspsDRBehUwrq2yGNOVrrzyymT3JWxT1cr4BD/rtvVRGD8F6s7sGLvqyQRNfq73m67+VKPPJTG9UaAChV31ZIKqb3YYgJMNzT6XxPRGgQoUxgSVXfVk/BTIQGFPATNBkxPxWhTWR2H8EKhAYWeeTFC1tCgsUBgfBCpQGBNUsfsoTjdZH4VJv0AGCrvqyQRNbPRYa1EYPwQqUNhVTyaoRITsSIjT1pltfBCoQGFMkOWEQ9aiML4IZKCwM08miLIjFiiMPwIWKOzckwkuCxTGLwELFMYEV3YkxGkLFMYHgQwU9gB6E0Q5kRDHTzf5XQzTCwUqUNhVTybI+vfJou5Eg9/FML1QoAKFMUE2uF8Ox05Zi8KknwUKY3qI/NwInx23FoVJv0AFCjvzZIJs5MC+1B1vsKHGTdoFKlAYE2Rji/MBeG3bQZ9LYnqbQAYKu+jJBNF1lw6hdEAf/vGtj/wuiullAhUoxC57MgEWCYf4Ulkxf9rxKXsPn/S7OKYXyZhAISLTRWSriOwQkUV+l8eYrtKVdfurX7gAgOVVNV1SNmNSEfG7AAAiEgYeBa4HaoB3RWSlqm5uz3oiIa9FYWP2m0zRVXU75sKiftxQXszPXt3OvsMnmTl+GGUlBQzMy+7KYhtzlowIFMDlwA5V3QkgIsuAWUC7DqaS/rlkR0Ks2nyACwblkZcdIScrRE4kRFY4RCiFU1OpnL1K6QRXSutJnqmrypLKKbnU1pNCntR2vAu2k0KeNlYUCQm5WeEU1tQpXVK34/3wlsu4/+Ut/NuGvfy6ag8AQ/JzKOmfS3FBLn2zw0TCISIhIRwSssIhwiEhEhIiYSEcCrVMe3lCZIXlTJ5QyOVz0yEhHBayQm49YWk5lkS8v0Xsu5aWtDPLz3pHvOXnzhOfT+KWe+uOXy5xy1vS4rYfv73YNuKdWy3OqyUJqk1713Fu3Tt/efL1nzebsEytb7NPVphwqOtOxWdKoBgO7ImbrwEmt3clkXCIGeOG8m8b9vH76touK5wJprmTR/KDv7isuzfTJXU73uB+Ofyf2z7P/7ipnI01h9m8r54dB47xSf0pdn92nFONUZqjSmOz994UVZqaozRFtWXeBNuKu6/g8yMKu2x9mRIoEoW+82qziCwAFgCMHDky4Yp+8lfj+a/XjWX/kVOcbGjmdFOU003NNDRF2xx+PJWrpTSFQcxTW0/nV5LK4Z5SWdK5rS4oSypSWc2lJfldsq02tFm3U6nXifTvk8UXxxbxxbFF7SqQ6pmA0RRVmpuVpmj0rPnGqAsycctiwaep2TsKVN3RoLGdUlTPfPdn5SGWrucsP3NMedNn14Hzlitx6zt/+2e2c2YbZ+/7OfPn5EhUb85LOidTm9tob/7zlrf/f05JYW6bn2mPTAkUNcCIuPlSYN+5mVT1ceBxgMrKyoTfnogwpqgfY4r6dUc5jWmvNut2KvW6K4m4007dftbNBEWmXPX0LjBWREaLSDYwB1jpc5mM6QpWt02PlxEtClVtEpGvAa8AYWCpqm7yuVjGdJrVbRME0lOf3SAiB4FEt6gOBj5Nc3HSIYj7lcn7dIGqtu/kfxdIUq8hs76vTClLppQDekZZOlSve2ygaI2IVKlqpd/l6GpB3K8g7lN3yqTvK1PKkinlgGCXJVP6KIwxxmQoCxTGGGOSCmKgeNzvAnSTIO5XEPepO2XS95UpZcmUckCAyxK4PgpjjDFdK4gtCmOMMV3IAoUxxpikAhUoMv2ZFiIyQkReFZFqEdkkIve49IEiskpEtrv3AS5dROQRtz8bRWRi3LrmufzbRWReXPokEXnffeYRSdPTnEQkLCLrReQFNz9aRN525fu1uysZEclx8zvc8lFx6/i2S98qItPi0jP675oO6f4ORGS3q0cbRKTKpbW7nnZw20tF5ICIfBCX1mXHSCfL8T0R2eu+lw0iMiNuWbfV33T870hKVQPxwrvr9UNgDJANvAeU+12uc8pYAkx00/nANqAceABY5NIXAfe76RnAy3gDy00B3nbpA4Gd7n2Amx7glr0DfMF95mXgxjTt2zeAfwJecPPPA3Pc9C+Au9z03wK/cNNzgF+76XL3N8sBRru/Zbgn/F3T8N2m/TsAdgODz0lrVz3txLavAiYCH3R028mOkU6W43vA3yXI2631Nx3/O5K9gtSiaBn3X1UbgNi4/xlDVfer6jo3fRSoxhuGehbwtMv2NHCzm54FPKOet4BCESkBpgGrVLVOVQ8Bq4DpblmBqr6pXq14Jm5d3UZESoEvA0+4eQGuBZa3sk+xfV0OXOfyzwKWqeppVd0F7MD7m2b83zUNMuU7aG897RBVfR2o6+S2Ex4jXVCO1nRr/e3u/x1tbT9IgSLRuP/DfSpLm9wplwnA20Cxqu4Hr0IAQ1y21vYpWXpNgvTu9jDwLSDq5gcBh1W1KUE5Wsrulh9x+du7r72JH9+BAv9XRNaKNww6tL+edqWuOka6wtfc6ZylsVM96SxHN/3vSCpIgSKlZ1pkAhHpB/wGWKiq9cmyJkjTDqR3GxG5CTigqmvjk5OUI+P3KQP58R1coaoTgRuBu0XkqiR5/fwbpbvePAZcCIwH9gMPpbMc3fi/I6kgBYqUnmnhNxHJwvtDP6eq/+qSa2NNdfd+wKW3tk/J0ksTpHenK4CZIrIbr1l9LV4Lo1BEYqMTx5ejpexueX+85n1797U3Sft3oKr73PsB4Ld4p1DaW0+7UlcdI52iqrWq2qyqUeCXeN9LWsrRzf87kmtvp0qmvvCGTN+J15EU6zSq8Ltc55RR8PoNHj4n/f9wdofUA276y5zdIfWOnumQ2oXXGTXATQ90y951eWOd2TPSuH/XcKYz+184uzP7b9303Zzdmf28m67g7M7AnXgdgRn/d03D95rW7wDIA/Ljpv8f3nnsdtXTTpZhFGd3InfZMdLJcpTETd+L1y/R7fU3Hf87km7f74Ogiyv4DLyrAT4EvuN3eRKU70q8Zt5GYIN7zcA7R78a2O7eY//0BXjU7c/7QGXcuv4ar8NsB3BHXHol8IH7zM9wd9+naf+u4UygGIN3BdYOvKCR49Jz3fwOt3xM3Oe/48q9lbirtTL975qm7zZt34H7273nXpti2+tIPe3g9v8Z77ROI94v4PldeYx0shzPuu1sxHsAVXzg6Lb6m47/HcleNoSHMcaYpILUR2GMMaYbWKAwxhiTlAUKY4wxSUXazpKZBg8erKNGjfK7GCag1q5d+6n68Mxsq9emO3W0XvfYQDFq1Ciqqqr8LoYJKBH5yI/tWr023amj9TqlU08iUigiy0Vkixu98AtdOWqh+DTiqTGHDx9m9uzZXHrppZSVlfHmm29SV1cHMNbqtjGeVPsolgD/rqqXAp/HG5BqEbBaVcfiXb8bGz73RmCsey3Au+UdERkIfBeYjHc343fjxkl5zOWNfa5dg3fF1B1vYHV1LYdPNHTk46YXuueee5g+fTpbtmzhvffeo6ysjMWLFwMczaS6vb32KH/cfrAjHzWm09oMFCJSgDfc7pMAqtqgqofJwBFPq/fXM//pKrZ+crQjHze9TH19Pa+//jrz588HIDs7m8LCQlasWAHwmcuWEXX7p2t2cN/yjR3cU2M6J5UWxRjgIPCUeA+meUJE8sjAEU8jIa9V3xS1mwhN23bu3ElRURF33HEHEyZM4M477+T48ePU1taCdzduxtTtimEF7DtyikPHrbVs0i+VQBHBe3jHY6o6ATjOmaZ4It02OqiILBCRKhGpOnjw/GZ4JOztTmNz9LxlxpyrqamJdevWcdddd7F+/Xry8vJip51a0y11u616DVAxrD8Am/YlGzDUmO6RSqCoAWpU9W03vxwvcKR9xFNVfVxVK1W1sqjo/Cu8ssKuRdFsLQrTttLSUkpLS5k8eTIAs2fPZt26dRQXFwNkQXrqdlv1GrwWBcCmfUfat5PGdIE2A4WqfgLsEZFLXNJ1wGa8AbFiV3fMA1a46ZXA7e4KkSnAEdd8fwW4QUQGuI6+G4BX3LKjIjLFXRFye9y62iUS8nanKWotCtO2oUOHMmLECLZu3QrA6tWrKS8vZ+bMmeANtgYZUrcH5GUzrH+utSiML1K9j+LrwHMiko03ZO4deEHmeRGZD3wM3ObyvoQ3quEO4ITLi6rWicj38YbBBvh7VY09ZvAu4FdAH7yhcV/uyM7EWhSN1qIwKfrpT3/K3LlzaWhoYMyYMTz11FNEo1EefPDBAhHZTobUbYDyYf2tRWF8kVKgUNUNeMNXn+u6BHkV75kDidazFFiaIL0KGJdKWZKJ9VFYi8Kkavz48a3d4LZNVc+q837WbYBxwwtYvaWWEw1N9M3usffKmh4oUGM9xa56shaFCaKKYf1Rher9dvm3Sa9ABYqsWIvCAoUJoFiH9mY7/WTSLFCBIhK76slOPZkAKumfy4C+WdahbdIuUIEiKxS7j8JaFCZ4RISKYf0tUJi0C1SgaGlR2A13JqAqhhWw9ZOjdlOpSatgBgobwsMEVPmwAhqao+w4cMzvopheJFCB4sypJ/u1ZYIpNpTHB3utQ9ukT6ACRSgkhMSuejLBNXpwHn2ywtZPYdIqUIECvJvuGu2qJxNQ4ZBQVpLPZgsUJo0CFyiyQmItChNoFcP6s3l/PVHrizNpErhAEQmH7KonE2gVwwo4drqJj+tO+F0U00sELlBkhYUGa1GYALNnU5h0C1ygiISsRWGC7eKh/YiExEaSNWkTvEARFruPwgRaTiTMRUP6WYvCpE3gAkVWOGT3UZjAs6E8TDoFLlBE7Kon0wtUDCvg02OnOVB/yu+imF4geIEiHLLRY03gnXmGtrUqTPcLXKDICouNHmsCr7wlUFiHtul+gQsUkZBYi8IEXn5uFhcM6mstCpMWwQsU4ZC1KEyvUDGswAKFSYvABYqssNh9FKZXqBjWn4/rTlB/qtHvopiASzlQiEhYRNaLyAtufrSIvC0i20Xk1yKS7dJz3PwOt3xU3Dq+7dK3isi0uPTpLm2HiCzqzA5FQiG7j8L0CuUtz9C2VoXpXu1pUdwDVMfN3w/8RFXHAoeA+S59PnBIVS8CfuLyISLlwBygApgO/NwFnzDwKHAjUA58xeXtEOvMNr2FXflk0iWlQCEipcCXgSfcvADXAstdlqeBm930LDePW36dyz8LWKaqp1V1F7ADuNy9dqjqTlVtAJa5vB1iQ3iY9mhubmbChAncdNNNAOzatYvJkycDjMuklnIiQ/JzKcrPsSufTLdLtUXxMPAtIPYfeBBwWFWb3HwNMNxNDwf2ALjlR1z+lvRzPtNaeodkRezUk0ndkiVLKCsra5m/7777uPfeewE+IINayq2pGFZgp55Mt2szUIjITcABVV0bn5wgq7axrL3picqyQESqRKTq4MGDCcubFRIbwsOkpKamhhdffJE777wTAFVlzZo1zJ49O5YlY1rKrakYVsD2A8c41djc1as2pkUqLYorgJkishuvsl+L18IoFJGIy1MK7HPTNcAIALe8P1AXn37OZ1pLP4+qPq6qlapaWVRUlLCwkbAN4WFSs3DhQh544AFC7lnrn332GYWFhUQisWqdOS3l1lQM609zVNlWe7SrV21MizYDhap+W1VLVXUUXhN7jarOBV4FYj+95gEr3PRKN49bvkZV1aXPced6RwNjgXeAd4Gx7iqqbLeNlR3dIRvCw6TihRdeYMiQIUyaNKklzaum58mIlnJrrEPbpEOk7Sytug9YJiL/AKwHnnTpTwLPisgOvJbEHABV3SQizwObgSbgblVtBhCRrwGvAGFgqapu6mihvFNP1qIwyb3xxhusXLmSl156iVOnTlFfX8/ChQs5fPgwTU2xrreELeWaFFvKJEk/i6o+DjwOUFlZ2a7KO2JAX/JzItahbbpVu264U9U/qOpNbnqnql6uqhep6m2qetqln3LzF7nlO+M+/wNVvVBVL1HVl+PSX1LVi92yH3Rmhwbm5VB/qpGTDXbO1rTuRz/6ETU1NezevZtly5Zx7bXX8txzzzF16lSWL49dzJc5LeXWhEJCmd2hbbpZZ1oUGenCIXmowq5Pj7fckGRMqu6//37mzJkDMA7YRYa0lJOpGFbAsnf20BxVwqFEZ7yM6ZzADeFxYVE/AD48eMznkpie4pprruGFF14AYMyYMbzzzjsAH2RSSzmZSRcM4GRjM//l2bXU2vMpTDcIXKAYPTgPEQsUpveYMa6E78wo44/bD/KlH7/G8+/uaa1j3pgOCVygyM0KU1KQy8d1J/wuijFpEQoJf3PVGP594VWUlRTwrd9s5Pal71BzyI4B0zUCFygA+uVGrDPb9DqjB+ex7G+m8P1ZFaz96BDTfvI6z765m6iNVGA6KZCBok9WmJN2p6rphUIh4atfGMUrC69i4gUD+J8rNjHnl29x4Kj1XZiOC2SgyM0KW4vC9GojBvblmb++nAdmf44NHx/m569+6HeRTA8WyEDRJztsY9+YXk9E+MvKEVxfUcyKDXtpaLIRC0zHBDJQ5Ebs1JMxMbMnlXLoRCNrthzwuyimhwpkoPBaFPbryRiAL140mCH5OSxfu6ftzMYkEMhAkWud2ca0iIRD3DKxlFe3HuTg0dN+F8f0QIEMFH2ywpyyzmxjWsyeNJzmqLJiw16/i2J6oGAGiuyQtSiMiXPRkHzGjyjkX6pq7K5t027BDBRZYZqiak+6MybObZWlbK09ygd7baRZ0z6BDBS5WWEAu0TWmDg3fW4Y2ZGQdWqbdgt0oLDTT8ac0b9PFtMqhrLivX2cbrJjw6QukIGhOXBHAAARv0lEQVSiT6xF0WCnnoyJN3tSKYdPNLK62u6pMKkLZqDIthaFMYlcedFghhbksnxtjd9FMT1IMAOFnXoyJqFwSLhl4nBe23aQA/aQI5OiQAaKnCxvt2xgQGPON3tSKc1R5bfr7Z4Kk5o2A4WIjBCRV0WkWkQ2icg9Ln2giKwSke3ufYBLFxF5RER2iMhGEZkYt655Lv92EZkXlz5JRN53n3lERDr14N+WPgrrsDPmPGOK+jHpggEsX2v3VJjUpNKiaAK+qaplwBTgbhEpBxYBq1V1LLDazQPcCIx1rwXAY+AFFuC7wGTgcuC7seDi8iyI+9z0zuxUXk4EgKOnmjqzGmMCa/akUrYfOMbGmiN+F8X0AG0GClXdr6rr3PRRoBoYDswCnnbZngZudtOzgGfU8xZQKCIlwDRglarWqeohYBUw3S0rUNU31ft580zcujpkeGEfAPbY41CNSejLnyshJxKyTm2Tknb1UYjIKGAC8DZQrKr7wQsmwBCXbTgQf0dPjUtLll6TIL3D8nIiFOXn8NFnxzuzGmMCqyA3i+njhrJiw167MdW0KeVAISL9gN8AC1U12RgAifoXtAPpicqwQESqRKTq4MGDScs7alBfdn9mLQpjWnPbpBHUn2ri99W1fhfFZLiUAoWIZOEFiedU9V9dcq07bYR7j93BUwOMiPt4KbCvjfTSBOnnUdXHVbVSVSuLioqSlvmCQXnWojBJ7dmzh6lTp1JWVkZFRQVLliwBoK6uDmBsJl6o0ZW+cOEghhf24UcvbWHDnsN+F8dksFSuehLgSaBaVX8ct2glEDsg5gEr4tJvdwfVFOCIOzX1CnCDiAxwB94NwCtu2VERmeK2dXvcujpsTFEetfWn7aHyplWRSISHHnqI6upq3nrrLR599FE2b97M4sWLAY5m4oUaXSkcEn72HycAMPux/8cvXvuQaNSugjLnS6VFcQXwVeBaEdngXjOAxcD1IrIduN7NA7wE7AR2AL8E/hZAVeuA7wPvutffuzSAu4An3Gc+BF7u7I5dd2kxAP/+wSedXZUJqJKSEiZO9BoF+fn5lJWVsXfvXlasWAHwmcuWURdqdLUJIwfw0n/9IteXF7P45S3Me+od+3FlzhNpK4Oq/onE/QgA1yXIr8DdraxrKbA0QXoVMK6tsrTHJUPzuaQ4n396+2P+0+QLCIUypsVvMtDu3btZv349kydPpra2FqARvAs1RKRbL9QQkQV4rQ5GjhzZJfvTHv37ZvHzuRNZ9u4e/vfvNnHjw3/kob/8PNdcMqTtD5teIZB3Zsf87dQL2fLJUf7x7Y/8LorJYMeOHePWW2/l4YcfpqCgIFnWbrlQoz19b91FRPjK5SP53deuZHC/HP7zU+/ygxc309BkA2uagAeKP//cMK6+uIjvv7CZT4/Zs4LN+RobG7n11luZO3cut9xyCwDFxcUAWZC+CzUyxdjifFZ87Qq+OuUCfvnHXcx45I8s+f12Pth7xO7i7sUCHShCIeGuay6ksVl5f6/dgWrOpqrMnz+fsrIyvvGNb7Skz5w5E2CQm824CzW6W25WmO/fPI7HvzqJ/NwID6/exk0//RN/tngN3/nt+7y65YDde9HLtNlH0dOVDfVOJVTvr2eqnXM1cd544w2effZZLrvsMsaPHw/AD3/4QxYtWsSDDz5Y4C7U+Bi4zX3kJWAG3kUXJ4A7wLtQQ0RiF2rA+Rdq/Arog3eRRqcv1EiXGyqGckPFUA4ePc2rWw+wurqW367fy3Nvf0yfrDBXXDSYCSMLKR9WQEVJAUX5OWTQ1b+mCwU+UPTvm8Xwwj5U7z/qd1FMhrnyyiuTnU7ZpqqV8QmZcqFGuhXl5/CXlSP4y8oRnGps5u1ddayuruW1bQfPullvcL9syof1p7ykgPJhBYwZnEfpgD7075NlAaSHC3ygACgrKWDzPjv1ZExn5WaFufriIq6+2Ot0rz/VyJb9R9m07wib99WzeX89T/5pJ43NZwJwv5wIwwv7MHxAH0oH9GF4YR8mXTCAylED/doN0069IlCUl+SzZkstpxqbW56nbYzpvILcLC4fPZDLR5/5p9/QFGXHgWN8XHeCmkMnqDl0kr2HT1Jz6CRVu+uod6M6f3HsYP7uhkv4/IhCv4pvUtQ7AsWwAqIKWz85apXSmG6WHQlRPsw7/ZTIkRON/MvaPfz8Dx8y69E3mF4xlG/ecDFji/PTXFKTqkBf9RRTVuJV2M37k41laIxJh/59s7jzi2N47b9dw71fupg/7fiUaQ+/zjeff88eDZChekWgGDGgL3nZYaotUBiTMfJzs7jnS2N5/VtTmX/laH63cR/XPvQHvrdyk933lGF6RaAIhYSykgILFMZkoIF52Xzny+W89t+uYfakETz71kdc/cCr/GTVNo6dtqdUZoJeESgAFyiO2uiYxmSokv59+NEtl7Hq3qu4+pIilqzeztUPvMqv3thlQ4n4rFcFimOnm/j6P6+3oQiMyWBjivrx87mT+Le7r+Di4ny+97vNfOnHr7Fiw177oeeTXhMori8v5vJRA3nx/f3Me+pdDp9o8LtIxpgkxo8o5J/+ZjJP//Xl5OVEuGfZBv78Z3/izQ8/a/vDpkv1mkBRlJ/DsgVTmFZRzOvbDvK7jfv9LpIxpg0iwtUXF/Hi16/k4b8az+ETjXzll2/x9X9ezydH7LkZ6dJrAgV4ndq/+E+TKB3Qh9e2HrRTUMb0EKGQcPOE4az+5tXcc91YXtn0Cdc+9Ad+8dqH1n+RBr0qUMCZXyi/r67lP/zg97y7u67tDxljMkJuVph7r7+Y3997NX924WAWv7yF6Q+/zuvbDvpdtEDrFXdmn2vhly5m9OA8nvjjLu58uoqvX3sR5SUFjBqcx8C8bBvmw5gMN3JQX56YV8mrWw7wv3+3iduXvsO0imJunVhKSAQRvBfS8viofjkRLirqx4C8bH8L3wP1ykBRlJ/DnV8cQ1F+Dt9avpF/eLG6ZVlWWLjzi2MYO6Qf0yqGkpfTK78iY3qEqZcO4c8uGsQTf9zFz9bs4JVNtW1+pig/h4uL+3Fxsfe45LHF+VxU1I+cLO8ESyzQhEQQXMDp5aPfSk89T19ZWalVVVVdsq5Pj52men89NYdO8vvNtaze4j3QbGBeNjeOG8q44f0p7JPFFy4cRGFf+zXSG4jI2nOHGU+HrqzXvc1nx06z7/ApFEXVe+asqrY8e/bIyUa21x5lW+0xttUeZXvtMU6m+ACmCSMLuXViKX/+uWH075vVbfvQ3Tpary1QJHDkZCOb99Xzj29/xKrNtS2dZdmREFdeNJjZk0q5sKgfJYW5FOT23EpjWmeBIviiUaXm0Em21R5l92fHaWxWou7/YTTqBZioKqeboqypPsDW2qNkR0JcX1bMrZOGc9XYIiLhntXN29F6nTHnVURkOrAECANPqOpiv8rS37UevnDhIE41NnPoRAP7Dp/kd+/t5/mqPazZcqAl76VD85k+biiTLhjABQPzGJCXRb4FD2MyXigkjBzUl5GD+raZ91vTLmHTvnqWr61hxYa9vPj+fgb3y+Hm8cOYMmYQ4bAQFiEkQijknbYKiRAOQSQUIhIWssIhIiHvPRwSciIhBuZl94jTWhnRohCRMLANuB7vgfTvAl9R1c2tfcavX16HTzSwp+4kOz89Rs2hk/xh6wGqPjpE/NeYnxuhf58sssIhssJCJOS9Z0dC9M2OkJcTJi87Ql5OhL7ZYXIiYbIjoZZXTjhETlaI7PCZtOxwiKyIV9FiFS8cErJCIcJhISvkzccvi7i0nlARM421KExrGpqivLr1AL9ZW8OrWw+c9ZCm9hrQN4vLSgv53PD+jBven8+V9qekf263HbM9vUVxObBDVXcCiMgyYBbQaqDwS2HfbAr7ZnNZaX8A7p56EXXHG9iyv56awyc5dLyBvYdPcux0E43NSlNzlMbmKI3NSkNTlMMnGqg51MSJhmaOnfbem7t5WIJw6OzA4b17wevsdO+XTizQxH4hiXjrOHc6JGd+OZ39K8rrBMRddeJ1DJ6ZFrcwFHdliriLUyR2xUpcWqit9QG48py9rlbW59LKSgq44qLB3frdm+DJjoSYVjGUaRVDOXS8gY/qThBVRVVpjnqnq6KqRKPQrLH/AUpTNEpTs9LYHKUpqpxsaGbrJ0fZuPcIj732Ycv/gcH9srlseH+G5Oe2bDMWN868ez8Os8IhIuEQ2WEhEg61/DidOX7YWZ/vrEwJFMOBPXHzNcBkn8rSbgPzsvmzTvzDaY56QaShKcrp5mZON0ZpaI62pDU0RzndGG2paE1Rr9I1R9XNexUvtqzZzTe3zJ+dfl6+2PJz0uMre2NzlKh6ZVXVlumWg0LdARJVmtV1Jrr4F8t/dicjgEt3HY4amz6nI1Ldut1HWtYTjVveEXMnj7RAYTplQF52l1xue6qxmer99by/9wgba47wfs0RqvcfBbz6DmfX86gqjbGg06w0NJ990+HlowcGMlAkamedd/iLyAJgAcDIkSO7u0xpEw4JfbLD9MkOA9a/0VEtASk+8HAmaMUHqqgqWaGe1RFpgis3K8yEkQOYMHJAhz7vtWa8H3oNzVH6dvG9YJkSKGqAEXHzpcC+czOp6uPA4+Cdy01P0UxPISKEY+eojOlFRLxTxpEw3XLDcKb8pHoXGCsio0UkG5gDrPS5TMYYY8iQFoWqNonI14BX8C6PXaqqm3wuljHGGDLk8tiOEJGDwEcJFg0GPk1zcVpjZUmsJ5TlAlUtSndhktRryKzvLSbTymTlSe4C4DvuNH7KemygaI2IVPlx/XsiVpbErCwdk4llzbQyWXna1pEyZUofhTHGmAxlgcIYY0xSQQwU7Tr31s2sLIlZWTomE8uaaWWy8rSt3WUKXB+FMcaYrhXEFoUxxpguFKhAISLTRWSriOwQkUU+bH+3iLwvIhtEpMqlDRSRVSKy3b137B79tre9VEQOiMgHcWkJty2eR9z3tFFEJnZzOb4nInvd97JBRGbELfu2K8dWEZnWVeVw6x4hIq+KSLWIbBKRe1x62r+XzvK7bicoz3l13YcypFznfSxPq3U/DeVpV/1PSt2ohz39hXej3ofAGCAbeA8oT3MZdgODz0l7AFjkphcB93fTtq8CJgIftLVtYAbwMt5YF1OAt7u5HN8D/i5B3nL3d8oBRru/X7gLy1ICTHTT+XhD2Zf78b10cj98r9sJynReXfehDCnXeR/Lk7Dup6k87ar/yV5BalG0DFWuqg1AbKhyv80CnnbTTwM3d8dGVPV1oC7Fbc8CnlHPW0ChiJR0YzlaMwtYpqqnVXUXsAPv79glVHW/qq5z00eBaryRitP+vXRSptZtX7WzzvtVHt90oP63KkiBItFQ5cPTXAYF/q+IrHUj3QIUq+p+8P5wwJA0lqe1bfvxXX3Nnc5ZGtfUTVs5RGQUMAF4m8z6XlKRieVKVNczgZ/HW2sS1f20SrH+typIgSKlocq72RWqOhG4EbhbRK5K8/ZTle7v6jHgQmA8sB94KJ3lEJF+wG+AhapanyxrOsrTAZlYrp5S1/3WWt1Pm3bU/1YFKVCkNFR5d1LVfe79APBbvFMGtbHTF+79QOtr6HKtbTut35Wq1qpqs6pGgV9y5vRSt5dDRLLwDpLnVPVfXXJGfC/tkHHlaqWuZwI/j7fzJKn7adHO+t+qIAUKX4cqF5E8EcmPTQM3AB+4Msxz2eYBK9JVpiTbXgnc7q7ymQIciTVFu8M55/n/Au97iZVjjojkiMhoYCzwThduV4AngWpV/XHcooz4Xtoho4bhT1LXM4Gfx9t5ktT9dGy7vfW/dX70xndjL/8MvJ79D/FGSEzntsfgXY3yHrAptn1gELAa2O7eB3bT9v8Zr2nbiPcLdH5r28Y7lfGo+57eByq7uRzPuu1sdJW0JC7/d1w5tgI3dvF3ciXeKZqNwAb3muHH99KT63aqdd2HcqRc530sT6t1Pw3laVf9T/ayO7ONMcYkFaRTT8YYY7qBBQpjjDFJWaAwxhiTlAUKY4wxSVmgMMYYk5QFCmOMMUlZoDDGGJOUBQpjjDFJ/X8i2Do7Bm+Z0gAAAABJRU5ErkJggg==\n",
      "text/plain": [
       "<Figure size 432x288 with 4 Axes>"
      ]
     },
     "metadata": {
      "needs_background": "light"
     },
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "词数(type)：51841\n",
      "[['the', 60960], ['What', 36995], ['of', 33987], ['in', 21785], ['to', 18443], ['was', 17065], ['is', 16198], ['did', 15634], ['what', 13219], ['a', 10753], ['How', 8023], ['Who', 8009], ['and', 7229], ['for', 7200], ['many', 5497], ['are', 5455], ['When', 5367], ['that', 4436], ['were', 4428], ['does', 4331]]\n"
     ]
    }
   ],
   "source": [
    "# TODO: 统计一下qlist中每个单词出现的频率，并把这些频率排一下序，然后画成plot. 比如总共出现了总共7个不同单词，而且每个单词出现的频率为 4, 5,10,2, 1, 1,1\n",
    "#       把频率排序之后就可以得到(从大到小) 10, 5, 4, 2, 1, 1, 1. 然后把这7个数plot即可（从大到小）\n",
    "#       需要使用matplotlib里的plot函数。y轴是词频\n",
    "from collections import Counter\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "qlist, alist = read_corpus('./data/train-v2.0.json')\n",
    "\n",
    "words_cnt = Counter()\n",
    "for text in qlist:\n",
    "    words_cnt.update(text.strip(' .!?').split(' '))\n",
    "\n",
    "value_sort = sorted(words_cnt.values(), reverse=True)\n",
    "plt.subplot(221)\n",
    "plt.plot(value_sort)\n",
    "plt.subplot(222)\n",
    "plt.plot(value_sort[:2000])\n",
    "plt.subplot(223)\n",
    "plt.plot(value_sort[:200])\n",
    "plt.subplot(224)\n",
    "plt.plot(value_sort[:20])\n",
    "plt.show()\n",
    "\n",
    "# 显示词频最高前10词，因为只取高频值，所以value转换时重合的概率较小，即时重合也没有太大影响\n",
    "inverse = dict(zip(words_cnt.values(), words_cnt.keys()))\n",
    "print(\"词数(type)：%d\" % len(words_cnt))\n",
    "print([[inverse[v], v] for v in value_sort[:20]])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "本来是想契合下zipf定律的，但是不是非常理想，只能看出形状来，主要原因应该是问答趋于同质化。另外明显看到虚词出现频率很高，所以可能需要去停用词。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 文本预处理\n",
    "在这里我们面对的是英文文本，所以任何对英文适合的技术都可以考虑进来。中文和英文类似，多了分词处理但不需要stemming。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 83,
   "metadata": {},
   "outputs": [],
   "source": [
    "# TODO: 对于qlist做文本预处理操作（不要对alist进行操作，因为答案直接调用即可，我们只匹配问题）。 可以考虑以下几种操作：\n",
    "#       1. 停用词过滤 （去网上搜一下 \"english stop words list\"，会出现很多包含停用词库的网页，或者直接使用NLTK自带的）   \n",
    "#       2. 转换成lower_case： 这是一个基本的操作\n",
    "#       3. 去掉一些无用的符号： 比如连续的感叹号！！！， 或者一些奇怪的单词。\n",
    "#       4. 去掉出现频率很低的词：比如出现次数少于10,20....\n",
    "#       5. 对于数字的处理： 分词完只有有些单词可能就是数字比如44，415，把所有这些数字都看成是一个单词，这个新的单词我们可以定义为 \"#number\"\n",
    "#       6. stemming（利用porter stemming): 因为是英文，所以stemming也是可以做的工作\n",
    "#       7. 也可以做下词形还原，但是在这个任务里个人感觉不是很有必要\n",
    "#       请注意，不一定要按照上面的顺序来处理\n",
    "from nltk.corpus import stopwords\n",
    "from nltk.stem import PorterStemmer\n",
    "from nltk.tokenize import word_tokenize\n",
    "import math\n",
    "\n",
    "# 加载nltk自带停用词，该停用词表个人感觉一般，具体到细分领域可能还是需要自己归纳\n",
    "sw = set(stopwords.words('english'))\n",
    "# 个人感觉对于一个问题而言这些词不应该删去\n",
    "sw -= {'who', 'when', 'why', 'where', 'how'}\n",
    "# 这里只是随便去了下符号\n",
    "sw.update(['\\'s', '``', '\\'\\''])\n",
    "ps = PorterStemmer()\n",
    "\n",
    "def text_preprocessing(text):\n",
    "    \"\"\"\n",
    "    对单条文本进行处理。\n",
    "    text: str类型\n",
    "    \n",
    "    return: 分词后的list\n",
    "    \"\"\"\n",
    "    \n",
    "    seg = list()\n",
    "    # 直接用nltk分词\n",
    "    for word in word_tokenize(text):\n",
    "        # 小写化、词干提取\n",
    "        word = ps.stem(word.lower())\n",
    "        # 数值归一\n",
    "        word = '#number' if word.isdigit() else word\n",
    "        # 去停用词\n",
    "        if len(word)>1 and word not in sw:\n",
    "            seg.append(word)\n",
    "    \n",
    "    return seg\n",
    "\n",
    "words_cnt = Counter()\n",
    "qlist_seg = list()\n",
    "for text in qlist:\n",
    "    seg = text_preprocessing(text)\n",
    "    qlist_seg.append(seg)\n",
    "    words_cnt.update(seg)\n",
    "    \n",
    "value_sort = sorted(words_cnt.values(), reverse=True)\n",
    "# 根据Zipf定律计算99%覆盖率下的过滤词频，解释见程序下边\n",
    "min_tf = value_sort[int(math.exp(0.99 * math.log(len(words_cnt))))]\n",
    "for cur in range(len(qlist_seg)):\n",
    "    qlist_seg[cur] = [word for word in qlist_seg[cur] if words_cnt[word] > min_tf]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "词频的过滤阈值是按照Zipf定律来算的，没见别人这么用过，自己感觉还算合理。\n",
    "\n",
    "Zipf's law一个实验定律，按照从最常见到非常见排列，第二常见的频率是最常见频率的出现次数的1/2，第三常见的频率是最常见的频率的1/3，第n常见的频率是最常见频率出现次数的1/n。\n",
    "\n",
    "假设我们文本的词频符合该定律，那么对1/n进行积分得到ln(n)，为了使99%的文本得到覆盖则需ln(x)>0.99\\*ln(n)，n是词type数，x是词频从高到底排列时的阈值分割点，最后x=e^(0.99\\*ln(n))。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 84,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[['when', 'beyonc', 'start', 'becom', 'popular'], ['area', 'beyonc', 'compet', 'when', 'wa', 'grow'], ['when', 'beyonc', 'leav', 'destini', 'child', 'becom', 'solo', 'singer'], ['citi', 'state', 'beyonc', 'grow'], ['decad', 'beyonc', 'becom', 'famou'], ['group', 'wa', 'lead', 'singer'], ['album', 'made', 'worldwid', 'known', 'artist'], ['who', 'manag', 'destini', 'child', 'group'], ['when', 'beyoncé', 'rise', 'fame'], ['role', 'beyoncé', 'destini', 'child']]\n"
     ]
    }
   ],
   "source": [
    "# 测试\n",
    "print(qlist_seg[:10])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 文本表示\n",
    "做完关键的预处理过程之后，就需要把每一个文本转换成向量。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 85,
   "metadata": {},
   "outputs": [],
   "source": [
    "# TODO: 把qlist中的每一个问题字符串转换成tf-idf向量, 转换之后的结果存储在X矩阵里。 X的大小是： N* D的矩阵。 这里N是问题的个数（样本个数），\n",
    "#       D是字典库的大小。 \n",
    "from sklearn.feature_extraction.text import TfidfVectorizer\n",
    "\n",
    "vectorizer = TfidfVectorizer() # 定义一个tf-idf的vectorizer\n",
    "\n",
    "X = vectorizer.fit_transform([' '.join(seg) for seg in qlist_seg])  # 结果存放在X矩阵"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 86,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(86821, 14548)\n",
      "input sparsity ratio: 0.9995937135512636\n"
     ]
    }
   ],
   "source": [
    "# 测试\n",
    "def sparsity_ratio(X):\n",
    "    return 1.0 - X.nnz / float(X.shape[0] * X.shape[1])\n",
    "\n",
    "print(X.shape)\n",
    "print(\"input sparsity ratio:\", sparsity_ratio(X))  # 打印出稀疏度(sparsity)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 对于用户的输入问题，找到相似度最高的TOP5问题，并把5个潜在的答案做返回"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 89,
   "metadata": {},
   "outputs": [],
   "source": [
    "from queue import PriorityQueue\n",
    "\n",
    "def top5results(input_q):\n",
    "    \"\"\"\n",
    "    给定用户输入的问题 input_q, 返回最有可能的TOP 5问题。这里面需要做到以下几点：\n",
    "    1. 对于用户的输入 input_q 首先做一系列的预处理，然后再转换成tf-idf向量（利用上面的vectorizer)\n",
    "    2. 计算跟每个库里的问题之间的相似度\n",
    "    3. 找出相似度最高的top5问题的答案\n",
    "    \"\"\"\n",
    "    q_vector = vectorizer.transform([' '.join(text_preprocessing(input_q))])\n",
    "    # 计算余弦相似度，tfidf默认使用l2范数；矩阵乘法\n",
    "    sim = (X * q_vector.T).toarray()\n",
    "\n",
    "    # 使用优先队列找出top5\n",
    "    pq = PriorityQueue()\n",
    "    for cur in range(sim.shape[0]):\n",
    "        pq.put((sim[cur][0], cur))\n",
    "        if len(pq.queue) > 5:\n",
    "            pq.get()\n",
    "    \n",
    "    pq_rank = sorted(pq.queue, reverse=True, key=lambda x:x[0])\n",
    "    # print([x[0] for x in pq_rank])\n",
    "    top_idxs = [x[1] for x in pq_rank]  # top_idxs存放相似度最高的（存在qlist里的）问题的下表 \n",
    "    \n",
    "    return [alist[i] for i in top_idxs]  # 返回相似度最高的问题对应的答案，作为TOP5答案    "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 90,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[0.9999999999999999, 0.9999999999999999, 0.6058098700501424, 0.5955903794475756, 0.5604486086527194]\n",
      "['Chengdu Shuangliu International Airport', 'Chengdu Shuangliu International Airport', 'aerodrome with facilities for flights to take off and land', 'newspapers', 'various gaming sites']\n",
      "[0.7797154765957257, 0.7103241311289762, 0.7038747251719334, 0.6245909883904857, 0.5811739588019266]\n",
      "['Plymouth City Airport', 'aerodrome with facilities for flights to take off and land', 'related', 'After the reunification', 'Nanjing Dajiaochang Airport']\n",
      "[0.9999999999999998, 0.7852110277213404, 0.49331031138548864, 0.4162177525363464, 0.33229596712940707]\n",
      "['Myanmar', 'foreign aid', '10 days', 'the British government', 'The latent heat of water condensation amplifies convection']\n",
      "[0.5947629389746683, 0.39360612204759465, 0.35791876809775003, 0.31237667954304615, 0.2999699083743183]\n",
      "['Myanmar', 'Isabel', 'foreign aid', 'Soviet Union and China', '10 days']\n"
     ]
    }
   ],
   "source": [
    "# 测试\n",
    "print(top5results(\"Which airport was shut down?\"))    # 在问题库中存在，经过对比，返回的首结果正确\n",
    "print(top5results(\"Which airport is closed?\"))\n",
    "print(top5results(\"What government blocked aid after Cyclone Nargis?\"))    # 在问题库中存在，经过对比，返回的首结果正确\n",
    "print(top5results(\"Which government stopped aid after Hurricane Nargis?\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "可以明显发现，tfidf方法是很难处理语义相似的问题的。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 利用倒排表的优化。 \n",
    "上面的算法，一个最大的缺点是每一个用户问题都需要跟库里的所有的问题都计算相似度。假设我们库里的问题非常多，这将是效率非常低的方法。 这里面一个方案是通过倒排表的方式，先从库里面找到跟当前的输入类似的问题描述。然后针对于这些candidates问题再做余弦相似度的计算。这样会节省大量的时间。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 92,
   "metadata": {},
   "outputs": [],
   "source": [
    "# TODO: 基于倒排表的优化。在这里，我们可以定义一个类似于hash_map, 比如 inverted_index = {}， 然后存放包含每一个关键词的文档出现在了什么位置，\n",
    "#       也就是，通过关键词的搜索首先来判断包含这些关键词的文档（比如出现至少一个），然后对于candidates问题做相似度比较。\n",
    "from collections import defaultdict\n",
    "\n",
    "inverted_idx = defaultdict(set)  # 制定一个一个简单的倒排表\n",
    "for cur in range(len(qlist_seg)):\n",
    "    for word in qlist_seg[cur]:\n",
    "        inverted_idx[word].add(cur)\n",
    "    \n",
    "def top5results_invidx(input_q):\n",
    "    \"\"\"\n",
    "    给定用户输入的问题 input_q, 返回最有可能的TOP 5问题。这里面需要做到以下几点：\n",
    "    1. 利用倒排表来筛选 candidate\n",
    "    2. 对于用户的输入 input_q 首先做一系列的预处理，然后再转换成tf-idf向量（利用上面的vectorizer)\n",
    "    3. 计算跟每个库里的问题之间的相似度\n",
    "    4. 找出相似度最高的top5问题的答案\n",
    "    \"\"\"\n",
    "    seg = text_preprocessing(input_q)\n",
    "    candidates = set()\n",
    "    for word in seg:\n",
    "        # 取所有包含任意一个词的文档的并集\n",
    "        candidates = candidates | inverted_idx[word]\n",
    "    candidates = list(candidates)\n",
    "    \n",
    "    q_vector = vectorizer.transform([' '.join(seg)])\n",
    "    # 计算余弦相似度，tfidf用的l2范数，所以分母为1；矩阵乘法\n",
    "    sim = (X[candidates] * q_vector.T).toarray()\n",
    "\n",
    "    # 使用优先队列找出top5\n",
    "    pq = PriorityQueue()\n",
    "    for cur in range(sim.shape[0]):\n",
    "        pq.put((sim[cur][0], candidates[cur]))\n",
    "        if len(pq.queue) > 5:\n",
    "            pq.get()\n",
    "    \n",
    "    pq_rank = sorted(pq.queue, reverse=True, key=lambda x:x[0])\n",
    "    print([x[0] for x in pq_rank])\n",
    "    top_idxs = [x[1] for x in pq_rank]  # top_idxs存放相似度最高的（存在qlist里的）问题的下表 \n",
    "    \n",
    "    return [alist[i] for i in top_idxs]  # 返回相似度最高的问题对应的答案，作为TOP5答案  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 93,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[1.0, 1.0, 0.6058098700501424, 0.5955903794475756, 0.5604486086527194]\n",
      "['Chengdu Shuangliu International Airport', 'Chengdu Shuangliu International Airport', 'aerodrome with facilities for flights to take off and land', 'newspapers', 'various gaming sites']\n",
      "[0.7797154765957257, 0.7103241311289762, 0.7038747251719334, 0.6245909883904857, 0.5811739588019266]\n",
      "['Plymouth City Airport', 'aerodrome with facilities for flights to take off and land', 'related', 'After the reunification', 'Nanjing Dajiaochang Airport']\n",
      "[0.9999999999999998, 0.7852110277213404, 0.49331031138548864, 0.4162177525363464, 0.33229596712940707]\n",
      "['Myanmar', 'foreign aid', '10 days', 'the British government', 'The latent heat of water condensation amplifies convection']\n",
      "[0.5947629389746683, 0.39360612204759465, 0.35791876809775003, 0.31237667954304615, 0.2999699083743183]\n",
      "['Myanmar', 'Isabel', 'foreign aid', 'Soviet Union and China', '10 days']\n"
     ]
    }
   ],
   "source": [
    "# 测试\n",
    "print(top5results_invidx(\"Which airport was shut down?\"))    # 在问题库中存在，经过对比，返回的首结果正确\n",
    "print(top5results_invidx(\"Which airport is closed?\"))\n",
    "print(top5results_invidx(\"What government blocked aid after Cyclone Nargis?\"))    # 在问题库中存在，经过对比，返回的首结果正确\n",
    "print(top5results_invidx(\"Which government stopped aid after Hurricane Nargis?\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 基于词向量的文本表示\n",
    "上面所用到的方法论是基于词袋模型（bag-of-words model）。这样的方法论有两个主要的问题：1. 无法计算词语之间的相似度  2. 稀疏度很高。 接下来我们采用词向量作为文本的表示，下载训练好的d=100（100维）的词向量[glove.6B.zip](https://nlp.stanford.edu/projects/glove/)。\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 96,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/home/six/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:23: DeprecationWarning: Call to deprecated `wv` (Attribute will be removed in 4.0.0, use self instead).\n",
      "/home/six/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:27: RuntimeWarning: invalid value encountered in true_divide\n"
     ]
    }
   ],
   "source": [
    "# TODO\n",
    "# 读取每一个单词的嵌入。\n",
    "# 句子向量 = 词向量的平均（出现在问句里的）， 如果给定的词没有出现在词典库里，则忽略掉这个词。\n",
    "from gensim.models import KeyedVectors\n",
    "from gensim.scripts.glove2word2vec import glove2word2vec\n",
    "import numpy as np\n",
    "\n",
    "# 将GloVe转为word2vec\n",
    "_ = glove2word2vec('./data/glove.6B.100d.txt', './data/glove2word2vec.6B.100d.txt')\n",
    "model = KeyedVectors.load_word2vec_format('./data/glove2word2vec.6B.100d.txt')\n",
    "\n",
    "def docvec_get(seg):\n",
    "    \"\"\"\n",
    "    将分词数据转为句向量。\n",
    "    seg: 分词后的数据\n",
    "    \n",
    "    return: 句向量\n",
    "    \"\"\"\n",
    "    vector = np.zeros((1, 100))\n",
    "    size = len(seg)\n",
    "    for word in seg:\n",
    "        try:\n",
    "            vector += model.wv[word]\n",
    "        except KeyError:\n",
    "            size -= 1\n",
    "    \n",
    "    return vector / size\n",
    "\n",
    "X = np.zeros((len(qlist_seg), 100))\n",
    "for cur in range(X.shape[0]):\n",
    "    X[cur] = docvec_get(qlist_seg[cur])\n",
    "\n",
    "# 计算X每一行的l2范数\n",
    "Xnorm2 = np.linalg.norm(X, axis=1, keepdims=True)\n",
    "X = X / Xnorm2\n",
    "\n",
    "def top5results_emb(input_q):\n",
    "    \"\"\"\n",
    "    给定用户输入的问题 input_q, 返回最有可能的TOP 5问题。这里面需要做到以下几点：\n",
    "    1. 利用倒排表来筛选 candidate\n",
    "    2. 对于用户的输入 input_q，转换成句子向量\n",
    "    3. 计算跟每个库里的问题之间的相似度\n",
    "    4. 找出相似度最高的top5问题的答案\n",
    "    \"\"\"\n",
    "    # 用词向量后用词形还原更合理，此处就不做变更了\n",
    "    seg = text_preprocessing(input_q)\n",
    "    # 直接用上边建好的倒排表\n",
    "    candidates = set()\n",
    "    for word in seg:\n",
    "        # 取所有包含任意一个词的文档的并集\n",
    "        candidates = candidates | inverted_idx[word]\n",
    "    candidates = list(candidates)\n",
    "    \n",
    "    q_vector = docvec_get(seg)\n",
    "    # 计算问题向量的l2范数\n",
    "    qnorm2 = np.linalg.norm(q_vector, axis=1, keepdims=True)\n",
    "    q_vector = q_vector / qnorm2\n",
    "    # 计算余弦相似度，前边已经l2规范化过，所以直接相乘\n",
    "    sim = (X[candidates] @ q_vector.T)\n",
    "\n",
    "    # 使用优先队列找出top5\n",
    "    pq = PriorityQueue()\n",
    "    for cur in range(sim.shape[0]):\n",
    "        pq.put((sim[cur][0], candidates[cur]))\n",
    "        if len(pq.queue) > 5:\n",
    "            pq.get()\n",
    "    \n",
    "    pq_rank = sorted(pq.queue, reverse=True, key=lambda x:x[0])\n",
    "    print([x[0] for x in pq_rank])\n",
    "    top_idxs = [x[1] for x in pq_rank]  # top_idxs存放相似度最高的（存在qlist里的）问题的下表 \n",
    "    \n",
    "    return [alist[i] for i in top_idxs]  # 返回相似度最高的问题对应的答案，作为TOP5答案  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 98,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[1.0, 1.0, 0.8875869259222657, 0.8826214612899687, 0.8355808887273302]\n",
      "['Chengdu Shuangliu International Airport', 'Chengdu Shuangliu International Airport', 'Terminal C', 'Nanjing Dajiaochang Airport', '1967']\n",
      "[0.9454294862808651, 0.9029611996952854, 0.9029611996952854, 0.9029611996952854, 0.8917413888585661]\n",
      "['Plymouth City Airport', 'southern suburbs of Paris', 'within the departure areas', 'India', 'Dushanbe International Airport']\n",
      "[0.9999999999999999, 0.852360897734157, 0.8518187365307014, 0.8508247887568896, 0.8409244964740952]\n",
      "['Myanmar', 'most Protestants (and most Jews)', 'lower house of parliament', 'the Tzu Chi Foundation', 'started an anti-separatist campaign']\n",
      "[0.8828545495470352, 0.8348415264745357, 0.816676060212699, 0.8107728682697369, 0.7993383778232651]\n",
      "['Myanmar', 'the Tzu Chi Foundation', 'started an anti-separatist campaign', 'public gaze', 'most Protestants (and most Jews)']\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/home/six/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:23: DeprecationWarning: Call to deprecated `wv` (Attribute will be removed in 4.0.0, use self instead).\n"
     ]
    }
   ],
   "source": [
    "# 测试\n",
    "print(top5results_emb(\"Which airport was shut down?\"))    # 在问题库中存在，经过对比，返回的首结果正确\n",
    "print(top5results_emb(\"Which airport is closed?\"))\n",
    "print(top5results_emb(\"What government blocked aid after Cyclone Nargis?\"))    # 在问题库中存在，经过对比，返回的首结果正确\n",
    "print(top5results_emb(\"Which government stopped aid after Hurricane Nargis?\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "对于第二个问题答得不好表示有点失望，不过同时试了些其他问题，整体还不错。"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.1"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {},
   "toc_section_display": true,
   "toc_window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
